5 research outputs found

    High-level synthesis of fine-grained weakly consistent C concurrency

    Get PDF
    High-level synthesis (HLS) is the process of automatically compiling high-level programs into a netlist (collection of gates). Given an input program, HLS tools exploit its inherent parallelism and pipelining opportunities to generate efficient customised hardware. C-based programs are the most popular input for HLS tools, but these tools historically only synthesise sequential C programs. As the appeal for software concurrency rises, HLS tools are beginning to synthesise concurrent C programs, such as C/C++ pthreads and OpenCL. Although supporting software concurrency leads to better hardware parallelism, shared memory synchronisation is typically serialised to ensure correct memory behaviour, via locks. Locks are safety resources that ensure exclusive access of shared memory, eliminating data races and providing synchronisation guarantees for programmers.  As an alternative to lock-based synchronisation, the C memory model also defines the possibility of lock-free synchronisation via fine-grained atomic operations (`atomics'). However, most HLS tools either do not support atomics at all or implement atomics using locks. Instead, we treat the synthesis of atomics as a scheduling problem. We show that we can augment the intra-thread memory constraints during memory scheduling of concurrent programs to support atomics. On average, hardware generated by our method is 7.5x faster than the state-of-the-art, for our set of experiments. Our method of synthesising atomics enables several unique possibilities. Chiefly, we are capable of supporting weakly consistent (`weak') atomics, which necessitate fewer ordering constraints compared to sequentially consistent (SC) atomics. However, implementing weak atomics is complex and error-prone and hence we formally verify our methods via automated model checking to ensure our generated hardware is correct. Furthermore, since the C memory model defines memory behaviour globally, we can globally analyse the entire program to generate its memory constraints. Additionally, we can also support loop pipelining by extending our methods to generate inter-iteration memory constraints. On average, weak atomics, global analysis and loop pipelining improve performance by 1.6x, 3.4x and 1.4x respectively, for our set of experiments. Finally, we present a case study of a real-world example via an HLS-based Google PageRank algorithm, whose performance improves by 4.4x via lock-free streaming and work-stealing.Open Acces

    A Case for Leveraging 802.11p for Direct Phone-to-Phone Communications

    Get PDF
    WiFi cannot effectively handle the demands of device-to-device communication between phones, due to insufficient range and poor reliability. We make the case for using IEEE 802.11p DSRC instead, which has been adopted for vehicle-to-vehicle communications, providing lower latency and longer range. We demonstrate a prototype motivated by a novel fabrication process that deposits both III-V and CMOS devices on the same die. In our system prototype, the designed RF front-end is interfaced with a baseband processor on an FPGA, connected to Android phones. It consumes 0.02uJ/bit across 100m assuming free space. Application-level power control dramatically reduces power consumption by 47-56%.Singapore-MIT Alliance for Research and TechnologyAmerican Society for Engineering Education. National Defense Science and Engineering Graduate Fellowshi

    High-level synthesis of fine-grained weakly consistent C concurrency

    No full text
    High-level synthesis (HLS) is the process of automatically compiling high-level programs into a netlist (collection of gates). Given an input program, HLS tools exploit its inherent parallelism and pipelining opportunities to generate efficient customised hardware. C-based programs are the most popular input for HLS tools, but these tools historically only synthesise sequential C programs. As the appeal for software concurrency rises, HLS tools are beginning to synthesise concurrent C programs, such as C/C++ pthreads and OpenCL. Although supporting software concurrency leads to better hardware parallelism, shared memory synchronisation is typically serialised to ensure correct memory behaviour, via locks. Locks are safety resources that ensure exclusive access of shared memory, eliminating data races and providing synchronisation guarantees for programmers.  As an alternative to lock-based synchronisation, the C memory model also defines the possibility of lock-free synchronisation via fine-grained atomic operations (`atomics'). However, most HLS tools either do not support atomics at all or implement atomics using locks. Instead, we treat the synthesis of atomics as a scheduling problem. We show that we can augment the intra-thread memory constraints during memory scheduling of concurrent programs to support atomics. On average, hardware generated by our method is 7.5x faster than the state-of-the-art, for our set of experiments. Our method of synthesising atomics enables several unique possibilities. Chiefly, we are capable of supporting weakly consistent (`weak') atomics, which necessitate fewer ordering constraints compared to sequentially consistent (SC) atomics. However, implementing weak atomics is complex and error-prone and hence we formally verify our methods via automated model checking to ensure our generated hardware is correct. Furthermore, since the C memory model defines memory behaviour globally, we can globally analyse the entire program to generate its memory constraints. Additionally, we can also support loop pipelining by extending our methods to generate inter-iteration memory constraints. On average, weak atomics, global analysis and loop pipelining improve performance by 1.6x, 3.4x and 1.4x respectively, for our set of experiments. Finally, we present a case study of a real-world example via an HLS-based Google PageRank algorithm, whose performance improves by 4.4x via lock-free streaming and work-stealing.Open Acces

    Scheduling Weakly Consistent C Concurrency for Reconfigurable Hardware

    No full text

    Hardware Synthesis of Weakly Consistent C Concurrency

    Get PDF
    Hardware Synthesis of Weakly Consistency C Concurrency This webpage contains additional material for the paper Hardware Synthesis of Weakly Consistent C Concurrency (FPGA17paper.pdf). Contents Verifying that our scheduling constraints implement the C11 standard Verifying the lock-free circular buffer Motivating examples Experimental Data 1. Verifying that our scheduling constraints implement the C11 standard As explained in Section 4.4, we have used the Alloy tool to verify that our scheduling constraints are sufficient to implement the memory consistency model defined by the C11 standard. In alloy.zip, we provide the Alloy model files that we used. The first four are taken from Wickerson et al.'s work on comparing memory consistency models; the fifth is new. relations.als contains helper functions. exec.als encodes general program executions. exec_C.als encodes C11 program executions. c11.als encodes the constraints imposed by the C11 memory consistency model. question.als encodes our scheduling constraints, and several queries for checking that these scheduling constraints are strong enough to enforce the constraints required by the C11 memory consistency model. To reproduce our result, save the five files above into the same directory, download Alloy, open question.als in Alloy, and run the queries contained within. 2. Verifying the lock-free circular buffer As explained in Section 5.1, we have used the CppMem tool to verify that our case-study application, a lock-free circular buffer, is free from data races. In order to make the verification feasible, we removed the while-loop, replaced the array variable with a scalar, removed some unimportant local variables, and simplified the increment function. Below, we give the actual code that we verified. int main() { atomic_int tail = 0; atomic_int head = 0; int arr = 0; {{{ { int chead = head.load(memory_order_acquire); int ctail = tail.load(memory_order_relaxed); if (ctail+1 != chead) { arr = 42; tail.store(ctail+1, memory_order_release); } } ||| { int ctail = tail.load(memory_order_acquire); int chead = head.load(memory_order_relaxed); if (ctail != chead) { arr; head.store(chead+1, memory_order_release); } } }}} return 0; } To reproduce our result, paste the code above into CppMem's web interface and click Run. The tool should report (after a couple of minutes) that the code has 92 candidate executions, of which two are consistent, neither of which exhibit data races. 3. Motivating Examples We provide the actual code that displays coherence and message-passing violation in motivating-examples.zip, as described in Section 2 of our paper. These examples have been tested and verified on LegUp's VM. For convenience, we have included the output logs of each experiment for viewing. We include a "transcript" of the execution and a schedule trace, which can be visualised using LegUp's scheduleviewer. 4. Experimental Data We also provide the raw experiment data from Section 5 of our paper. We conduct two experiments, for which we provided raw data on all seven design points. chaining.csv contains the raw data for the chaining experiment. bursting.csv contains the raw data for the bursting experiment. The version number are from 1 to 7 for Unsound, SC, OMP criticals, SC atomics, Weak atomics, Mutexes and OMP atomics respectively
    corecore